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Abstract — In this paper we show that combination of the 
minimum description length principle and a exchange-ability 
condition leads directly to the use of Jeffreys prior. This ap- 
proach works in most cases even when Jeffreys prior cannot 
be normalized. Kraft's inequality links codes and distributions 
but a closer look at this inequality demonstrates that this link 
only makes sense when sequences are considered as prefixes of 
potential longer sequences. For technical reasons only results for 
exponential families are stated. Results on when Jeffreys prior 
can be normalized after conditioning on a initializing string are 
given. An exotic case where no initial string allow Jeffreys prior 
to be normalized is given and some way of handling such exotic 
cases are discussed. 

I. Introduction 

A major problem in Bayesian statistics is to assign prior 
distributions and to justify the choice of prior. The minimum 
description length (MDL) approach to statistics is often able 
to overcome this problem, but although MDL may look quite 
similar to Bayesian statistics the inference is different. One of 
the main results in MDL is that Jeffreys prior is asymptotically 
minimax optimal with respect to both redundancy and regret. 
Despite this positive result there are two serious technical 
complications that we will address in this paper. 

The first complication is that in MDL the use of a code 
based on Jeffreys prior is normally considered as suboptimal 
to the use of the normalized maximum likelihood distribution. 
Jeffreys prior turn out to be optimal if we make a more 
sequential approach to online prediction and coding. The key 
idea is to consider extended sequences. 

The second complication is that in many important appli- 
cations, Jeffreys prior cannot be normalized. When Jeffreys 
prior cannot be normalized it is often (but not always) the 
case that the Shtarkov integral is infinite so that the NML 
distribution does not exist. This problem is often handled by 
conditioning of a short sequence of initial data. In Bayesian 
statistics this has lead to a widespread use of improper prior 
distribution and in MDL it has lead to the definition of the 
SNML predictor. Our sequential approach will justify the use 
of improper Jeffreys priors and describe in which sense the 
use of improper Jeffreys distributions is normally preferable 
to the SNML predictor. 

In the classical frequential approach to statistics a finite 
sequence is considered as a sub-sequence of an infinite se- 
quence. I Bayesian statistics a finite sequence is normally 
considered without reference to longer sequences. In this paper 
we will take a standpoint in between. We will think of a finite 
sequence as a prefix of potentially longer finite sequences. 
Only in this way we can justify the equivalence between codes 
and distributions via Kraft's inequality. In this short paper we 
shall restrict our attention to exponential families to avoid 



technical complications related to measurablity etc. Despite 
this restriction our results cover many important applications 
and the model is still sufficiently flexible to illustrate ideas 
that can be generalized to a more abstract setting. 

The rest of this paper is organized as follows. In Section 
Ull notation is fixed and some well-known basic results are 
stated in the way that we are going to use them. In Section 
iHll we will see that the use of Kraft's inequality is relevant 
if we consider short sequences as sub-sequences of longer 
sequences. In Section [TV] we define exponential prediction 
systems and we will see how such systems are given by prior 
measures on the parameter space and for which sequences 
conditional distributions exists. In Section [V] the optimality 
of Jeffreys prior is described and some results on when 
conditional distributions exists are stated. These sections are 
given in the logical order of reasoning but they can be read 
quite independently. In this short note many proofs have been 
left out or strongly foreshortened. The paper ends with a short 
discussion. 

II. Preliminaries 
A. Definitions for exponential families 

The exponential family {Pp \ (3 S pcanj b ase( j on (jj e 
probability measure Pq is given in a canonical parametrization, 



dPp exp (f3x) 
dP7> = Z(f3) 



(1) 



where Z is the partition function Z((3) = J exp(fix) dPox, 
and r can := {j3 \ Z(jf) < oo} is the canonical parameter 
space. Note that we allow the measure P to have both discrete 
and continuous components. We let f3 sup = sup{/3 \ (3 6 
r can }, and /3j n f likewise. The trivial case where r can has no 
interior points is excluded from the analysis. In Equation Q] fix 
will denote the product of real numbers when the exponential 
family is 1 -dimensional and f3x will denote a scalar product 
when the exponential family has dimension k > 1 so that 
f3 and x are vectors in K fe . See JTJ for more details on 
exponential families. 

For our problem it is natural to work with extended expo- 
nential families as defined in [2|. For a probability distribution 
Q on M fe the convex support cs (Q) is the intersection of all 
convex closed sets that have Q-probability 1 . The convex core 
cc (Q) is the intersection of all convex measurable sets with 
Q-probability 1, Q. We have cc(Q) C cs (Q) . An extreme 
point x in cs (Q) belongs to cs (Q) if and only if Q (x) > 0. 
In its mean value parametrization the exponential family based 
on a measure with bounded support has a natural extension to 
cc (Q) . In particular S x belongs to the extended exponential 



family if Q has a point mass in x and x is an extreme point 
of cs (Q) . 

The elements of the exponential family are also 
parametrized by their mean value \i. We write \xp for the 
mean value corresponding to the canonical parameter (3 and 
$ (if) for the canonical parameter corresponding to the mean 
value fi. Note that we allow infinite values of the mean. The 
element in the exponential family with mean /i is denoted 
P M . The mean value range M of the exponential family is 
the range of f3 p,p and is a subset of the convex core. 
We write /i sup = sup M, and /i; n f = inf M. If Pq has a 
point mass at ii m f > — oo and the support of Pq is a subset 
of [/iinf,oo[, then the exponential family is extended by the 



element P_ no = P w » f = 5„ 



and likewise the exponential 



family is extended if Q has a point mass in /i sup < oo and 
the support of Q is a subset of ]— oo, /i sup ] . For any x the 
distribution Pa, us the maximum likelihood distribution. 

The covariance function V is the function that maps /it € M 
into the covariance of P M . If M has interior points then the 
exponential family is uniquely determined by its covariance 
function. The Fisher information of an exponential family 
in its canonical parametrization is Ip = V (up) and the 
Fisher information of the exponential family in its mean value 
parametrization is I 11 = V (fi) 1 . 

For elements of an exponential family we introduce infor- 
mation divergence as 



D(x\\y): = D(P x \\py) = 



m (*L\ dP> 
V dpy J 



This defines a Bregman divergence on the convex core and 
under some regularity conditions this Bregman divergence 
uniquely characterizes the exponential family [4|. 

B. Posterior distributions 

If the mean value parameter has prior distribution v and x 
has been observed then the posterior distribution has density 



dv (-\x) 
dv 



(y) ~ exp(-D(x\\y)). 



Notation We use x m to denote (xi,X2, ■ ■ ■ , x m ) and x^ to 
denote (x m , x m +i, . . . , x n ) . We use r as short for 2tt. 

If a sequence x\, X2, ■ ■ ■ , x n has been observed then the 
posterior distribution has density 

±l^l(y)^f[e^(-D(x t \\y)) 

2=1 

m 

= Y\cxp(-D(xi\\x)) -exp(-nD(x\\y)) 

i=l 

where x denotes the average of the sequence xi,X2, ■ ■ ■ , x n , 
where we have an equality that is of general validity for 
Bregman divergences. Since the first factor does not depend 
on y we have 



dv(-\x m ) 
dv 



(y) ~exp(-mD(x||j/)). 



C. MDL in exponential families 

For some exponential families the minimax regret Coo is 
finite. See (5) for details about how this quantity is defined. 
If Coois finite the minimax regret is assumed if we code 
according to the NML distribution. In general the optimal code 
for Xi will depend on whether the sample size is n = 1 
or whether X\ is considered as a sub-sequence of X n . In 
cases where Coo is infinite one may use a conditional versions 
instead such as sequential NML (SNML). 

Of central importance for our approach are result of Barron, 
Rissanen et al. that if the parameter space of an exponential 
family is restricted to a non-empty compact subset of the 
interior of the convex core, then the minimax regret is finite 
and equal to 



Coo = Jin - +ln J + o(l), 

Z T 

where J denotes the Jeffreys integral 



(2) 



J 



(detlpf /2 dp = (detV (x)y 1/2 dx. (3) 



M 



where Ip denotes the Fisher information matrix. Moreover, 
the same asymptotic regret (|2) is achieved by the Bayesian 
marginal distribution equipped with Jeffreys prior. In MDL 
this result is often used as the most important reason for using 
Jeffreys prior with density w(ff) — (det V (/i)) - ^ 2 / J, but use 
of the NML predictor requires knowledge of the sample size 
and the performance of the SNML predictor will depend on 
the order of the observations except if it corresponds to the 
use of Jeffreys prior J6). 

If the parameter space is restricted to a non-empty compact 
subset of the interior of the convex core (called an ineccsi set 
in 0) the Jefftreys integral is automatically finite but typically 
there is no natural way of restricting the parameter space in 
applications and in most cases the Jeffreys integral is infinite. 
It thus becomes quite relevant to investigate what happens if 
the parameter spaces are not restricted to an ineccsi set. To 
answer this question, one needs to know when the Jeffreys 
integral is finite, and how to handle situations where Jeffreys 
integral is not finite. 

D. Exchangability, sufficiency, and consistency 

Prediction in exponential families satisfy the exchangability 
condition that the probability of sequence does not depend on 
the order of the elements. We may also say the predictor is 
invariant under permutations of the elements in a sequence. 
The importance of this exchangablity condition in MDL was 
emphasized in Q, but a related but more important type of 
exchangablity is that the probability of a sequence given a 
sub sequence x 3 does not depend on the order of the obser- 
vations in the sub-sequence. A stronger requirement is that 
the predicted probability of a sequence given a sub-sequence 
x J only depends the average x, i.e. the sample average is a 
sufficient statistic. We are also interested in consistency of the 
system of predictors. A system of predictors is consistent if the 



prediction P j | x ) = P ( 



b m+l | 



consistent system of predictors is generated from predictions 
of the next symbol given by the past symbols. 

III. MDL and Kraft's inequality 

We recall that a code is uniquely decodable if any finite 
sequence of input symbols give a unique sequence of output 
symbols. It is well-known that a uniquely decodable code 
satisfies Kraft's inequality 



J2r e{a) <i 



(4) 



where I (a) denotes the length of the codeword corresponding 
to the input symbol a and f3 denotes the size of the output 
alphabet. The length of a codeword is an integer. Normally 
the use of non-integer valued code length functions is justified 
by reference to the noiseless coding theorem which require 
some interpretation of the notion of probability distributions 
and their mean values. To emphasize our sequential point of 
view we formulate a version of Kraft's inequality that allow 
the code length function to be non-integer valued. 

Theorem 1. Let £ : A —y R be a function. Then the function i 
satisfies Kraft's inequality (0 if and only if for all e > there 
exists an integer n and a uniquely decodable fixed-to-variable 
length block code k : A" — > B* such that 



r «(a n )--y>(ai) 

i=l 



< £ 



where l R (a™) denotes the length £ K (a") divided by n. The 
uniquely decodable block code can be chosen to be prefix free. 

It is only possible to obtain a unique correspondence be- 
tween code length functions and (discrete) probability mea- 
sures by considering codewords as prefixes of potentially 
longer codewords. If we restrict our attention to code words of 
some finite fixed length then Kraft's inequality does not give 
a necessary and sufficient condition of decodability. Like in 
Bayesian statistics we focus on finite sequences. Contrary to 
Bayesian statistics we should always consider a finite sequence 
as a prefix of longer finite sequences. Contrary to frequential 
statistics we do not have to consider a finite sequence as a 
prefix of an infinite sequence. 

If the set of input symbols is not discrete one has to 
introduce some type of distortion measure, but we will abstain 
from discussing this complication in this short note. 

IV. Improper priors 

A. Finiteness structure 

If a sequence of length m with average x is observed then 
the prior integral is either finite or infinite. Let F m denote the 
subset of average values in the convex core such that the prior 
integral is finite for samples of size m. 

Theorem 2. The sets F m form a increasing sequence of 
convex subsets of the convex core, i.e. F± C _F 2 C F$ C . . . cc. 

Example 3. Consider the Gaussian location family. For this 
family D (y\\x) = ^SzM m Assume that the prior has density 



exp (ax 2 ) . Then the prior can be normalized to a posterior 
distribution when 



/ 2 \ I ( x ~ y) \ 
exp (ax ) exp — m ax 



so the integral is finite when m > 2a. If the prior has density 
exp (a; 4 ) then there exists no m for which the prior can be 
normalized. 

Theorem 4. Assume that fi\ G F, m and /io is in the convex 
core. Then (l - S) /i + f Ml € K- 

Proof: Let p s = (l - f ) Mo + f Ml . Then 

/ 777, \ 777 

D(n.\\n) > 1-- Unroll M) + -£> (Mill M)- m (2) 
V n 1 n 



> -£>( Ml ||/i)-ln(2). 



Hence 



J exp(—nD(n s \\ij))di>ii<2 n J exp (— mD (J,)) dv/j < oo. 



An important special case is when the convex core equals 
R d . In this case we have that if F n ^ then F n +i = R fe . 
The next example shows that the theorem is 'tight'. 

Example 5. The family of exponential distributions has 
D (A||/x) = ^- — 1— In ^. Consider the prior density exp (a;^ 1 ) ■ 
x~ 2 . The conditional integral is 

exp (a; -1 ) • x~ 2 exp ^— m ^ 1 — In — jj dx. 

The integral exp • x~ 2 dx is finite so we only have 

to consider the integral 



exp (x 1 ) • x 2 exp l—m ( 1 — In — 

v ' \ \x x 



dx 



exp (n) / exp ((1 — mx) x 1 ) • x n x 2 dx . 



o 

-l 



The substitution y = x gives 

exp ((1 — mx) • x n x~ 2 dx 



o 



exp ((1 — mx) y) ■ y n dy . 



We see that for n > 1 the integral is finite if and only if 
x > i/m, which implies that F m = [i/m, oo[ . 

B. Existence of a prior 

In this section we will talk about a prior even when it cannot 
be normalized and we will call it a proper prior when it can 
be normalized to a probability measure. 

We will now define an exponential prediction system. We 
consider a sequence of variables Xi,Xi-, ■ ■ ■ with values in R d . 
For some sequences of outcomes x m a probability measure 



P (-\x m ) on R d is given and the interpretation of this probabil- 
ity measure is that it gives the probability or prediction of the 
next variable X m+ i given the values of the previous variables. 
Equivalently we may think of P (-\x m ) as an instruction about 
how the next variable should be coded given the values of the 
previous variables. Further we will assume that if P (-|x m ) is 
defined then P (■\x n ) is also defined for any sequence x n with 
x m as prefix. Further we will assume that the sum is sufficient 
for prediction, i.e. P (-\x m ) only depends on the value of the 
sum X\ + x 2 + ■ ■ ■ + x m . 

An exponential prediction system as described above can be 
extended to a consistent prediction system for sequences and 
we note that the sum is still sufficient for predicting sequences. 
Conversely, a consistent prediction system for sequences can 
be reconstructed from its restriction to predictions of the next 
symbol. 

Assume that P (-\x m ) exists. Then we have a con- 
sistent system of probability measures on the variables 
X m+ i, X m+ 2, ■ ■ ■ for which the sums of the previous vari- 
ables are sufficient statistics for the following variables. Ac- 
cording to results of S. Lauritzen any such system is a mixture 
of elements in an exponential family [8 1. Therefore there exists 
a measure Po and a probability measure v x n over the convex 
core such that 



dP(-\x m ) 

dP 



(x) 



exp 



cc Z 



&v xm y. 



These 'prior distributions' v x n are updated to 'posterior dis- 
tributions' in the usual fashion 



(x) ~ exp(-D(x m+ i||a;)). 



Theorem 6. For an exponential prediction system there exists 
an exponential family and a prior measure v over the mean 
value range M of the exponential family such that 



dP(-\x r - 



Or) 



exp 



exp(-D(x\\y)) 



M z(j3(yfj / A / ex P(- D (^ll V)) diy V 
V. Jeffreys prior 



dvy. 



A. Conditional regret 

We will use conditional regret to evaluate the quality of a 
predictor. First we consider the case where the distributions are 
discrete. Let P* denote a distribution in the exponential family 
and compare it with a predictor Q (-\-) . If the distribution 
P x is used to code the sequence x n then the code length is 
-InP* (x n ). In order to code the same sequence using Q (-|-) 
and initial string x m has to be known and in this case the code 
length for the rest of the sequence is -InQ (a;^ +1 |a; m ). The 



regret is the difference 

REG ( x n | x m ) = - In Q \x m ) - (- In P* (x n )) 

_ ^ P* (x n ) 

' n Q{Kn+l\x m )' 

If the distributions are not discrete the point probabilities have 
to be replaced by densities with respect to some dominating 
measure. 

The conditional Jeffreys integral is defined as 



J I x ri 



exp (-mD(P x || P x )) 



V (xf 2 

where x is the sample average of x m . 



dx 



B. Optimality of Jeffreys prior 

We are now able to combine our sequential approach with 
existing results on optimality of Jeffreys prior. 

Theorem 7. Assume that (P x ) is a exponential family based 
on the probability measure Pq and let Q {•{■) denote and 
exponential prediction system based on the probability mea- 
sure Qo with prior measure v on the mean value range. If 
Qo — Po an d the support of v equals the closure of the mean 
value range of the exponential family then for any P x in the 
extended exponential family with x in the convex core and any 
sequence X\, X2, ■ ■ ■ satisfying liminf n ■ D (P x \\ P x ) = oo 
the conditional regret of the exponential prediction system 
Q(-|-) is eventually less than the conditional regret of P x 
with respect to the sequence x%,X2, ■ ■ ■ Exponential prediction 
systems based on Pq and with dense prior are the only 
exponential prediction systems satisfying this property. 

Further conditions are needed in order to single out the 
Jeffreys prior. 

Theorem 8. An exponential prediction system is asymptoti- 
cally optimal with respect to minimax regret if and only if 
it is based on Jeffreys prior. More precisely, there exists an 
element P x in the exponential family corresponding to an 
interior point x in the convex core and a sequence X\Xi ■ ■ ■ 
such that x n — > x such that the regret of the exponential 
prediction system satisfies 

lim inf ( REG (x n \ x m ) - - In - ] > In ( J | x m ) . 

n-s-oc y It) 

Further if the exponential prediction system is based on 
Jeffreys prior and an element P x in the exponential family 
corresponding to an interior point x in the convex core and 
X\X2 ■•■ is a sequence such that x n — ¥ x then 

n )-\\n^)<\n{J\x m )- 



lim sup REG ( x 



This theorem has important consequences. For instance it 
becomes much easier to prove the recent result that the SNML 
predictor is exchangable if and only if it is equivalent to the 
use for Jeffreys prior |]7]. 



C. When is conditional Jeffreys Finite? 

After having identified Jeffreys prior as optimal it is of 
interest to see how long sequences are needed before the 
conditional Jeffreys integral becomes finite. Most exponential 
families used in applications have finite conditional Jeffreys 
integral after just one sample point. For a one dimensional 
exponential family one can divide the parameter interval into 
a left part and a right part and treat these independently. The 
following results cover all cases relevant for applications. 

Theorem 9. Let Q be a measure for which the convex core 
is lower bounded. Assume that a is the left end point of M. If 



Q has density f (x) = (x 



\7-l 



g (x) in an interval just to 



the right of a where g is an analytic function and g (a) > 
then the conditional Jeffreys integral of the right truncated 
exponential family is finite. 

Griinwald and Harremoes have previously shown that under 
the conditions of the previous theorem if there is a point mass 
in a then the unconditional Jeffreys integral is also finite ||9l . 



Theorem 10. Let (Y c q'\Q) represent a left-truncated expo- 
nential family that is light tailed in the sense that there exists 
a Gamma exponential family such that the variance function 
V of (T c m ,Q) satisfy 

V(x) 



lim inf 



V 7 (x) 



> 



then the conditional Jeffreys integral is finite where Vy (x) 
denotes the variance function of the gamma exponential family. 

Proof: The gamma exponential families have finite con- 
ditional Jeffreys integral. We use the formula 



D{P»\\P V ) 



V(x) 



dx 



and the formula for the conditional Jeffreys integral 



exp(-n.D(P' i ||P a: )) 



da; 



m V{x) 1/2 

to conclude that a larger variance function leads to a smaller 
Jeffreys integral. ■ 
The following theorem extends a theorem from ||9l . 

Theorem 11. Let (I 1 ™", Q) represent a left-truncated exponen- 
tial family such that /3 SU p = and Q admits a density q either 
with respect to Lebesgue measure or counting measure. If q is 
heavy tailed the Jeffreys integral is finite, if and only if all the 
conditional Jeffreys integrals are finite. If q(x) — 0(l/x 1+a ) 
for some a > 0, then Jeffreys integral J M V(x)^ 1 ^ 2 dx is finite. 

Proof: Assume that q is heavy tailed. Griinwald and Har- 
remoes have shown that [9| in this case sup y>2 , D (Q X \\Q V ) < 
oo, which implies that the factor cxp (-mD (Q x \\Q V )) in the 
integrand of the conditional Jeffreys integral is lower bounded. 
The proof of the second half of the theorem follows directly 
from (9]. ■ 
Most exponential families with finite minimax regret also 
have finite Jeffreys but there are counter examples and they 



give exponential families for which the Jeffreys integral is 
always infinite. 

Example 12. If Y is a Cauchy distributed random variable 
then X = exp (Y) has a very heavy tailed distribution that we 
will call a exponentiated Cauchy distribution. A probability 
measure Q is defined as a 1/2 and 1 /2 mixture of a point 
mass in and an exponentiated Cauchy distribution. As 
shown by Griinwald and Harremoes [9] this distribution has 
finite minimax regret but infinite Jeffreys integral. Hence, the 
conditional Jeffreys integrals are all infinite. 

VI. Discussion 

The notion of sufficiency has been generalized by S. Lau- 
ritzen (8) and generalizations of his results to the setting 
presented here is highly relevant but cannot be covered in 
this short note. In this short note the purpose of Thm. [8] was 
to uniquely identify exponential prediction systems based on 
Jeffreys prior, but using the information topology iTTOl sharper 
versions can be formulated. In cases where the Jeffreys integral 
is infinite and the minimax regret is finite one cannot find 
an optimal exponential prediction system, so exchangability 
cannot be achieved. In such cases the usual NML predictor or 
the SNML predictor may be good alternatives. Much of what 
has been said here about minimax regret Coo will also hold 
for minimax redundancy C\ or for any capacity of order a 
denoted C a ATI . 
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